-
Notifications
You must be signed in to change notification settings - Fork 612
Add with_output version AppendAttention #3302
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Thanks for your contribution! |
self.causal, | ||
self.speculative_method is not None, | ||
)[0] | ||
if self.use_output: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
只修改这个文件还不行,全局搜一下所有调用这个append_attention的地方
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
好的
fastdeploy/model_executor/layers/attention/ops/append_attention.py
Outdated
Show resolved
Hide resolved
麻烦再丰富一下PR描述,说明一下改造的背景、目标 |
fastdeploy/model_executor/layers/attention/append_attn_backend.py
Outdated
Show resolved
Hide resolved
fastdeploy/model_executor/layers/attention/append_attn_backend.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.
…into append_attn_pr
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
除了下面俩直接能改的,还有 append_attention
申明为自定义算子的这部分,Outputs
也要改一下
因为咱 append_attention
不是不需要输出 qkv_out
了嘛,所以删掉它
PD_BUILD_STATIC_OP(append_attention)
.Inputs({"qkv",
......
.Outputs({"fmha_out", "qkv_out", "key_cache_out", "value_cache_out"}) # <--- 这一行
.SetInplaceMap({{"key_cache", "key_cache_out"},
改成
.Outputs({"fmha_out", "key_cache_out", "value_cache_out"})
PS: 自定义算子的注册出问题总是直接抛出这种异常,后续主框架也要添加更多上下文信息
terminate called after throwing an instance of 'std:bad_array_new_length'
what(): std: : bad_array_new_length
custom_ops/gpu_ops/cpp_extensions.cc
Outdated
const paddle::Tensor &decoder_tile_ids_per_batch, | ||
const paddle::Tensor &decoder_num_blocks, | ||
const paddle::Tensor &set_max_lengths, const paddle::Tensor &max_len_kv, | ||
paddle::Tensor &res, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
paddle::Tensor &res, | |
paddle::Tensor &fmha_out, |
.Attrs({"compute_type: std::string", | ||
"cache_quant_type: std::string", | ||
"use_neox_rotary_style: bool", | ||
"rope_3d: bool", | ||
"max_input_length: int", | ||
"quant_max_bound: float", | ||
"quant_min_bound: float", | ||
"out_linear_in_scale: float", | ||
"encoder_block_shape_q: int", | ||
"decoder_block_shape_q: int", | ||
"max_partition_size: int", | ||
"encoder_max_partition_size: int", | ||
"speculate_max_draft_token_num: int", | ||
"causal: bool", | ||
"speculate_decoder: bool", | ||
"rms_norm_eps: float"}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里把 rms_norm_eps
的顺序往前移动一下
.Attrs({"compute_type: std::string", | |
"cache_quant_type: std::string", | |
"use_neox_rotary_style: bool", | |
"rope_3d: bool", | |
"max_input_length: int", | |
"quant_max_bound: float", | |
"quant_min_bound: float", | |
"out_linear_in_scale: float", | |
"encoder_block_shape_q: int", | |
"decoder_block_shape_q: int", | |
"max_partition_size: int", | |
"encoder_max_partition_size: int", | |
"speculate_max_draft_token_num: int", | |
"causal: bool", | |
"speculate_decoder: bool", | |
"rms_norm_eps: float"}) | |
.Attrs({"rms_norm_eps: float", | |
"compute_type: std::string", | |
"cache_quant_type: std::string", | |
"use_neox_rotary_style: bool", | |
"rope_3d: bool", | |
"max_input_length: int", | |
"quant_max_bound: float", | |
"quant_min_bound: float", | |
"out_linear_in_scale: float", | |
"encoder_block_shape_q: int", | |
"decoder_block_shape_q: int", | |
"max_partition_size: int", | |
"encoder_max_partition_size: int", | |
"speculate_max_draft_token_num: int", | |
"causal: bool", | |
"speculate_decoder: bool", | |
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
背景:cudagraph 捕获过程中的张量地址管理
目的:将attention模块的输出前置,便于cudagraph捕获时的张量地址处理